This code is part of a data analysis that predicts the taxi tips to riders in NYC. It uses historical data coming from the NYC Taxi & Limousine Commission. For this exercise, we use data corresponding to March, June and November 2017. This document presents the data exploration, the issues identified and the solutions applied.
We start by setting the environment and loading the data.
library (dplyr)
library (lubridate)
library (skimr)
library (ggplot2)
library (rgdal)
library (leaflet)
## We load and bind together March, June and November 2017 data.
yellowtripdata <- dplyr::as_tibble(read.csv("data/raw_data/yellow_tripdata_2017-03.csv"))
yellowtripdata <- rbind (yellowtripdata , read.csv("data/raw_data/yellow_tripdata_2017-06.csv"))
yellowtripdata <- rbind (yellowtripdata , read.csv("data/raw_data/yellow_tripdata_2017-11.csv"))
| Field Name | Description |
|---|---|
| VendorID | A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc |
| tpep_pickup_datetime | The date and time when the meter was engaged. |
| tpep_dropoff_datetime | The date and time when the meter was disengaged. |
| Passenger_count | The number of passengers in the vehicle.This is a driver-entered value. |
| Trip_distance | The elapsed trip distance in miles reported by the taximeter. |
| PULocationID | TLC Taxi Zone in which the taximeter was engaged. |
| DOLocationID | TLC Taxi Zone in which the taximeter was disengaged. |
| RateCodeID | The final rate code in effect at the end of the trip. 1= Standard rate; 2=JFK; 3=Newark; 4=Nassau/Westchester; 5=Negotiated; 6=Group ride |
| Store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip; N= not a store and forward trip. |
| Payment_type | A numeric code signifying how the passenger paid for the trip. 1= Credit card; 2= Cash; 3= No charge; 4= Dispute; 5= Unknown; 6= Voided trip |
| Fare_amount | The time-and-distance fare calculated by the meter. |
| Extra | Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges. |
| MTA_tax | $0.50 MTA tax that is automatically triggered based on the metered rate in use. |
| Improvement_surcharge | $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. |
| Tip_amount | Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. |
| Tolls_amount | Total amount of all tolls paid in trip. |
| Total_amount | The total amount charged to passengers. Does not include cash tips. |
We validate that data has been properly loaded.
str (yellowtripdata)
## tibble [29,236,424 × 17] (S3: tbl_df/tbl/data.frame)
## $ VendorID : int [1:29236424] 2 2 2 2 2 1 1 1 1 2 ...
## $ tpep_pickup_datetime : chr [1:29236424] "2017-03-09 21:30:11" "2017-03-09 21:47:00" "2017-03-09 22:01:08" "2017-03-09 22:16:05" ...
## $ tpep_dropoff_datetime: chr [1:29236424] "2017-03-09 21:44:20" "2017-03-09 21:58:01" "2017-03-09 22:11:16" "2017-03-10 06:26:11" ...
## $ passenger_count : int [1:29236424] 1 1 1 1 1 1 1 1 1 1 ...
## $ trip_distance : num [1:29236424] 4.06 2.73 2.27 3.86 3.45 2.8 6 8.7 3.7 4.21 ...
## $ RatecodeID : int [1:29236424] 1 1 1 1 1 1 1 1 1 1 ...
## $ store_and_fwd_flag : chr [1:29236424] "N" "N" "N" "N" ...
## $ PULocationID : int [1:29236424] 148 48 79 237 41 261 87 142 68 261 ...
## $ DOLocationID : int [1:29236424] 48 107 162 41 162 79 142 181 141 163 ...
## $ payment_type : int [1:29236424] 1 2 1 1 2 1 1 1 1 1 ...
## $ fare_amount : num [1:29236424] 14 11.5 10 12 12 12.5 19.5 30 16.5 20.5 ...
## $ extra : num [1:29236424] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 ...
## $ mta_tax : num [1:29236424] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ tip_amount : num [1:29236424] 3.06 0 2.82 3.99 0 1 3.5 7.8 1.5 4.26 ...
## $ tolls_amount : num [1:29236424] 0 0 0 0 0 0 0 0 0 0 ...
## $ improvement_surcharge: num [1:29236424] 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num [1:29236424] 18.4 12.8 14.1 17.3 13.3 ...
We transform the timestamps and categorical variables to a proper format. Then we run the command skim to get a snapshot of the dataframe.
yellowtripdata <- yellowtripdata %>%
mutate (tpep_pickup_datetime = ymd_hms (tpep_pickup_datetime),
tpep_dropoff_datetime = ymd_hms (tpep_dropoff_datetime),
VendorID = factor (VendorID),
RatecodeID = factor (RatecodeID),
store_and_fwd_flag = factor (store_and_fwd_flag),
PULocationID = factor (PULocationID),
DOLocationID = factor (DOLocationID),
payment_type = factor (payment_type)
)
skim (yellowtripdata)
| Name | yellowtripdata |
| Number of rows | 29236424 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| factor | 6 |
| numeric | 9 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| VendorID | 0 | 1 | FALSE | 2 | 2: 15960277, 1: 13276147 |
| RatecodeID | 0 | 1 | FALSE | 7 | 1: 28407414, 2: 652907, 5: 97859, 3: 61946 |
| store_and_fwd_flag | 0 | 1 | FALSE | 2 | N: 29111958, Y: 124466 |
| PULocationID | 0 | 1 | FALSE | 262 | 237: 1159313, 161: 1097686, 236: 1053298, 186: 1010489 |
| DOLocationID | 0 | 1 | FALSE | 263 | 236: 1090520, 161: 1082557, 237: 1016790, 170: 934023 |
| payment_type | 0 | 1 | FALSE | 5 | 1: 19844531, 2: 9194104, 3: 154080, 4: 43708 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| passenger_count | 0 | 1 | 1.62 | 1.26 | 0.00 | 1.00 | 1.00 | 2.00 | 192.00 | ▇▁▁▁▁ |
| trip_distance | 0 | 1 | 2.92 | 4.48 | 0.00 | 0.97 | 1.60 | 3.01 | 9496.98 | ▇▁▁▁▁ |
| fare_amount | 0 | 1 | 13.11 | 147.28 | -550.00 | 6.50 | 9.50 | 14.50 | 630461.82 | ▇▁▁▁▁ |
| extra | 0 | 1 | 0.33 | 0.46 | -53.71 | 0.00 | 0.00 | 0.50 | 69.80 | ▁▁▇▁▁ |
| mta_tax | 0 | 1 | 0.50 | 0.07 | -0.50 | 0.50 | 0.50 | 0.50 | 140.00 | ▇▁▁▁▁ |
| tip_amount | 0 | 1 | 1.87 | 2.65 | -112.00 | 0.00 | 1.36 | 2.46 | 450.00 | ▅▇▁▁▁ |
| tolls_amount | 0 | 1 | 0.33 | 1.97 | -17.50 | 0.00 | 0.00 | 0.00 | 1018.95 | ▇▁▁▁▁ |
| improvement_surcharge | 0 | 1 | 0.30 | 0.01 | -0.30 | 0.30 | 0.30 | 0.30 | 1.00 | ▁▁▇▁▁ |
| total_amount | 0 | 1 | 16.45 | 147.52 | -550.30 | 8.75 | 11.80 | 17.80 | 630463.12 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| tpep_pickup_datetime | 0 | 1 | 2001-01-01 00:04:13 | 2041-11-15 02:57:16 | 2017-06-14 07:51:59 | 7021748 |
| tpep_dropoff_datetime | 0 | 1 | 2001-01-01 00:04:51 | 2041-11-15 03:12:19 | 2017-06-14 08:05:43 | 7032908 |
| Issues observed | Cause | Plan to resolve it | Observations |
|---|---|---|---|
| Negative $ amounts | payment type | payment_type == 1 OR payment_type == 2 AND force any amount to be >= 0 | |
| max (passenger_count) == 192 / outliers | wrong input | 0 < passenger_count < 6 | link |
| trip_distance outliers | ? | Apply the interquartile range method to remove outliers | |
| fare_amount outliers | ? | Apply the interquartile range method to remove outliers | |
| tip_amount outliers | ? | Apply the interquartile range method to remove outliers | |
| outlier dates e.g. year == 2001 | meter has wrong date setting | year==2017 | |
| tolls_amount outliers | ? | tolls_amount < 125 | The interquantile method seems to fail, probably due to a non-continuous distribution. |
yellowtripdata_clean <- yellowtripdata %>%
filter (payment_type == 1 | payment_type == 2,
trip_distance > 0. ,
passenger_count > 0 & passenger_count < 6,
fare_amount > 0.,
extra >= 0.,
mta_tax >= 0.,
tip_amount >= 0.,
tolls_amount >= 0.,
improvement_surcharge >= 0.,
total_amount > 0. ,
year (tpep_pickup_datetime) == 2017 & year (tpep_dropoff_datetime) == 2017
)
## To remove outliers we apply the interquartile range method
## to the variables: trip_distance, fare_amount and tip_amount.
Interquartile_range_method <- function (df, variable) {
## This function applies the interquartile range method to
## remove outliers from a "variable" in the dataframe "df".
## It returns a dataframe free of outliers.
Q1 <- quantile(variable, .25)
Q3 <- quantile(variable, .75)
IQR <- IQR(variable)
no_outliers <- subset(df, variable> (Q1 - 1.5*IQR) & variable< (Q3 + 1.5*IQR))
return (no_outliers)
}
yellowtripdata_clean <- Interquartile_range_method (yellowtripdata_clean , yellowtripdata_clean$trip_distance)
yellowtripdata_clean <- Interquartile_range_method (yellowtripdata_clean , yellowtripdata_clean$fare_amount)
yellowtripdata_clean <- Interquartile_range_method (yellowtripdata_clean , yellowtripdata_clean$tip_amount)
yellowtripdata_clean <- yellowtripdata_clean %>% filter (tolls_amount < 125.)
skim (yellowtripdata_clean)
| Name | yellowtripdata_clean |
| Number of rows | 24044303 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| factor | 6 |
| numeric | 9 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| VendorID | 0 | 1 | FALSE | 2 | 2: 12845255, 1: 11199048 |
| RatecodeID | 0 | 1 | FALSE | 7 | 1: 24039893, 5: 2941, 3: 827, 4: 597 |
| store_and_fwd_flag | 0 | 1 | FALSE | 2 | N: 23954727, Y: 89576 |
| PULocationID | 0 | 1 | FALSE | 260 | 237: 1054410, 161: 950254, 236: 946389, 186: 879831 |
| DOLocationID | 0 | 1 | FALSE | 261 | 236: 996775, 161: 954776, 237: 934094, 170: 838995 |
| payment_type | 0 | 1 | FALSE | 2 | 1: 16217689, 2: 7826614, 3: 0, 4: 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| passenger_count | 0 | 1 | 1.49 | 1.02 | 1.00 | 1.00 | 1.00 | 2.00 | 5.00 | ▇▂▁▁▁ |
| trip_distance | 0 | 1 | 1.74 | 1.14 | 0.01 | 0.90 | 1.40 | 2.27 | 6.09 | ▇▇▂▁▁ |
| fare_amount | 0 | 1 | 9.42 | 4.10 | 0.01 | 6.00 | 8.50 | 12.00 | 21.30 | ▁▇▅▂▁ |
| extra | 0 | 1 | 0.32 | 0.37 | 0.00 | 0.00 | 0.00 | 0.50 | 64.20 | ▇▁▁▁▁ |
| mta_tax | 0 | 1 | 0.50 | 0.01 | 0.00 | 0.50 | 0.50 | 0.50 | 7.68 | ▇▁▁▁▁ |
| tip_amount | 0 | 1 | 1.33 | 1.24 | 0.00 | 0.00 | 1.26 | 2.15 | 5.37 | ▇▅▃▁▁ |
| tolls_amount | 0 | 1 | 0.01 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 101.52 | ▇▁▁▁▁ |
| improvement_surcharge | 0 | 1 | 0.30 | 0.00 | 0.00 | 0.30 | 0.30 | 0.30 | 0.30 | ▁▁▁▁▇ |
| total_amount | 0 | 1 | 11.89 | 4.79 | 0.31 | 8.16 | 10.80 | 14.76 | 117.82 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| tpep_pickup_datetime | 0 | 1 | 2017-03-01 | 2017-12-01 23:28:58 | 2017-06-13 23:27:13 | 6742524 |
| tpep_dropoff_datetime | 0 | 1 | 2017-03-01 | 2017-12-02 00:00:29 | 2017-06-13 23:39:01 | 6752360 |
It is strange to observe a p25 of 0 for tip_amount when in the US is very rare that people do not tip a driver. As it is described, cash tips are not included, when customer pay in cash drivers prefer to not or can not record the tip. Let us check it out…
yellowtripdata_clean %>%
ggplot (aes (x= tip_amount)) + geom_histogram(bins = 30) + facet_grid( payment_type~. )
## We need to remove the payments done in cash.
yellowtripdata_clean <- yellowtripdata_clean %>% filter (payment_type == 1)
To finish we add extra features from the timestamps: hour, month, weekday (Sunday =1,…,Saturday = 7) and trip time. From trip time and distance we create a trip_mean_speed [miles/min]. We also remove the outliers from the trip time using the same method as before and visualize the result.
yellowtripdata_clean <- yellowtripdata_clean %>% mutate (
pickup_hour = factor (hour (tpep_pickup_datetime)),
dropoff_hour = factor (hour (tpep_dropoff_datetime)),
pickup_wday = factor (wday (tpep_pickup_datetime)),
dropoff_wday = factor (wday (tpep_dropoff_datetime)),
pickup_month = factor (month (tpep_pickup_datetime)),
dropoff_month = factor (month (tpep_dropoff_datetime)),
trip_time_min = as.numeric(difftime(tpep_dropoff_datetime, tpep_pickup_datetime , units="min")),
trip_mean_speed = trip_distance/trip_time_min
)
yellowtripdata_clean <- Interquartile_range_method (yellowtripdata_clean , yellowtripdata_clean$trip_time_min)
yellowtripdata_clean %>% ggplot (aes (x=trip_time_min)) + geom_histogram(bins = 30)
yellowtripdata_clean %>% ggplot (aes (x = trip_mean_speed)) + geom_histogram(bins =30) + xlab ("trip_mean_speed [miles/min]")
## Warning: Removed 124 rows containing non-finite values (stat_bin).
We ensure that the mean speed of the trip makes sense by required a trip distance larger than 1 mile and longer than 5 min.
yellowtripdata_clean <- yellowtripdata_clean %>% filter (trip_distance > 1 & trip_time_min >5)
yellowtripdata_clean %>% filter (trip_distance > 1 & trip_time_min >5) %>%
ggplot (aes (x = trip_mean_speed)) +
geom_histogram(bins =30) +
xlab ("trip_mean_speed [miles/min]")
To visualize the results we are going to create a NYC divided and represent some results directly in the map using the district numbers provided in the data.
We start by loading a shapefile from NYC taxi zones. This data is freely available at NYC Open Data webpage.
##
ny_taxi_zones <- readOGR("data/NYC Taxi Zones/geo_export_a3336026-bd63-4f88-a657-06200b5a5e82.shp")
## Warning in OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS =
## dumpSRS, : Discarded datum WGS84 in CRS definition: +proj=longlat +ellps=WGS84
## +no_defs
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/garoe.gonzalez/Desktop/analysis/data/NYC Taxi Zones/geo_export_a3336026-bd63-4f88-a657-06200b5a5e82.shp", layer: "geo_export_a3336026-bd63-4f88-a657-06200b5a5e82"
## with 263 features
## It has 6 fields
leaflet(ny_taxi_zones) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(popup = ~zone, stroke = TRUE, color = "black", weight = 0.2)
We add the mean tip given by customers that finish their trip in a certain taxi zone. We could have chosen the pickup location or both, but we decided to just use drop-off for simplifying the exercise, and because tip is given then.
## We extract all the info we can get from the neighborhood polygons.
zones_info <-data.frame (borough = ny_taxi_zones$borough,
location_i= ny_taxi_zones$location_i,
zone = ny_taxi_zones$zone)
## We compute the tip mean per drop location ID.
tip_DO_ID <- yellowtripdata %>% group_by (DOLocationID) %>%
summarize (tip_mean = mean (tip_amount)) %>%
mutate (DOLocationID = as.integer (DOLocationID))
## `summarise()` ungrouping output (override with `.groups` argument)
## We join the data back to the neighborhood names.
zones_info<- zones_info %>% left_join(tip_DO_ID, by = c ("location_i" = "DOLocationID"))
## We add the tip information to the shapefile polygons.
ny_taxi_zones$tip_mean <- zones_info$tip_mean
Since we only end up using one payment_type and we have extracted all relevant information from the timestamps, we remove the variables from our dataframe. We also filter out the VendorID which do not bring relevant information about the tip, customer or trip.
yellowtripdata_clean <- yellowtripdata_clean %>% select (-payment_type,-tpep_dropoff_datetime,
-tpep_pickup_datetime, -VendorID)
We have reduced our data from 29236424 to 10876305 events (37%) and we are going to use in the next phase of the analysis 21 variables.
| Issues observed | Action |
|---|---|
| Negative $ amounts | payment_type == 1 AND force any amount to be >= 0 |
| max (passenger_count) == 192 / outliers | 0 < passenger_count < 6 |
| trip_distance outliers and very small | Interquartile range method & > 1 mile |
| fare_amount outliers | Interquartile range method |
| tip_amount outliers | Interquartile range method |
| outlier dates e.g. year == 2001 | year==2017 |
| tolls_amount outliers | tolls_amount < 125 |
| tip_amount == 0 | payment_type == 1 |
| trip_time_min outliers and very small | Interquartile range method & > 5 min |
We save the results in R data objects to be easily loaded by the next phase of the analysis.
saveRDS (yellowtripdata_clean, "data/clean_data/YellowTripCleanData_03_06_11_2017.rds")
saveRDS (ny_taxi_zones, "data/clean_data/ny_taxi_zones_enrich_map.rds")